Interim Report - Capstone Project

Group - Capstone NLP - 2 Group - 2

Mentor - Mr. Rohit Raj

Members

  1. John Cherian
  2. Kiran Bobba
  3. Ajay
  4. Shankhadeep
  5. Mario Mathew

Part 1

Summary of problem statement, data and findings

1.1 Problem Statement

For industries around the world, accidents in the work place are of a major concern, since it affects the lives and well being of their employees, contractors and their families and the industry faces loses in terms of hospital charges, litigation fees, reputation and lost employee morale. Based on these facts it is intented to build a chatbot that can highlight the safety risk as per the incident description to the professionals including:

1.Personnel from the safety and complaince team

2.Senior management from the plant

3.Personnel from other plants across the globe

4.Government and industrial safety groups 5.Anyone interested or doing research in industrial safety

6.Emergency health and safety teams

7.Fire safety and industrial hazard teams

8.General management

9.Other personnel requiring safety risk information

so that these professionals can:

Take preventive and proactive measures based on past history React faster to employee satisfaction realated to safety Help postion the equipment and machinery in a safe place where risk of potential acceidents can be minimised Gain insights about safety in industries safety is paramound Reduce insurance costs by better handling of personnel, equipment and other resources Take other safety related decisions and actions

1.2 Outcome

The user should be able to input an incident description and the chatbot should be able to predict the potential accident or vulnerability levels which can be extended or configured to different scenarios

1.3 The Data

The dataset basically describes the accident incidents from twelve different plants across three different countries and consists of four hundred and twenty five records It has the following columns:

Date: timestamp or time/date information

Countries: Which country the accident occurred (anonymised)

Local: The city where the manufacturing plant is located (anonymised)

Industry sector: Which sector the plant belongs to

Accident level: From I to VI, it registers how severe was the accident (I means not severe but VI means very severe)

Potential Accident Level: From I to VI, depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)

Gender: If the person involved is male of female

Employee or Third Party: If the injured person is an employee or a third party / contractor

Critical Risk: Description of the risk involved in the accident

Description: Detailed description of how the accident happened

1.4 Summary of findings and implications

On inspection of the dataset it appears that:

1.The dataset is limited and consists of four hundred and twenty five records only so training the models with high accuracy could be a challange

2.The dataset is imbalanced on certain variables like potential accident level and accident level, this means that we may not get consistent results unless the dataset is treated to reduce imbalance

3.Minor accidents are more common than major accidents, this looks similar to real world situations

4.There is data from three countries

5.There are twelve locals or cities from which the data is taken

6.There are two industry sectors - mining, metals and third all others grouped together as others

7.There are five accident levels

8.There are six potential accident levels

9.There are employees, third parties and remote third parties involved in the accidents

10.There are thirty three diffrent types of critical risk one of which has been assigned to a accident incident

11.The accident description is highly unclean and so it will require a considerable amount of effort to clean it to produce results

12.The dataset consists of data from January 2016 to July 2017

13.Males are more involved than females in accidents, this too looks similar to real world situations as there are considerably lower number of females working in industrial environments

Part 2

2. 1 Summary of the Approach to EDA and Pre-processing

Approach - We have agreed on designing a Chatbot capability using slack as an UI interface integrating with RASA and API that triggers the underlying NLP Model that gets build

We have established agreed intermediate goals and progressed on the below process steps

As part of the NLP Model building we have adoptped the below process steps

Data processing techniques Data cleansing Features engineering Lematizing,stemming Removing stop words and Glove embedding Data visualization with charts to be able to see clearly how the data is spread across different dimentions with univariate, bi and multi variate analysis Model designing - As part of model designing we have designed and trained the below models

Random Forest Gradient Boosting Lgistic regerssion SVM and Neural Network classifiers such as

RNN LSTM and Bi-directional LSTM FastText and we are fine tuning and evaluating the best performing model to be shipped for the API that gets triggered from Slack user interface Findings From the data analysis we could infer that

Many Body related actions and accidenrs have been found A lot of equipment related accidents cited in the dataset Poor features with lack of quality or inadequate data resulting in class imbalance

Since the data shows that the Accident severty is Low for Critical we will have to consider both Accident level as well as Potential accident level for the Model prediction

Data Preparation for Time Series Analysis

Replacing categorical values

Exploratory Data Analysis

Univariate Analysis

Country 01 the most effected country which accounts for 251 accidents country 03 is the least effected country which accounts for 44 accidents in the dataset
Local_3 is the most affected city which accounts for 90 accidents and it belongs to counry_01 Local_11 is the lesat affected city which accounts for 02 accidents and it belongs to Country_01 Local_09 and Local_12 is also lesat affected city which accounts for 02 and 04 accidents respectivily and it belongs to Country_02
Most accidents happened in mining Industry sector .Its count is 241 **Mining sector is the most affected and
Accident level 1 is the most occured accident level and Accident level 5 is the least of all the accidents in the dataset.
Most "Potential Accident Level" belongs to level 4 .Its count is 143. Least "Potential Accident Level" belongs to level 5. Its count is 32.
Male is the most affected gender and female is the least*. Its counts are 403 and 22 accidents.
The most affected employees are Third Party. Its counts are 189. Third Party remote are the least. Its counts are 57.
Most of the Critical risk belongs to other class. Its counts are 232 because in real life most of the accidents are not disclosed.
First quater is the most affected quater which accounts for 154 accidents and Fourth quater is the least affected quater which is 58 accidents. Country_01 accounts for 59% accidents and Country_03 is 10%.
**Mining sector accounts for 57% accidents of the total accidents Others are the least effected Industry which accounts for 12% of the total accidents.**

Bivariate and Multivariate Analysis

Country_01 is the most effected country and most of the classes of Potential Accident Level belongs to country_01

Country_01 is the most effected country and most of the classes of Accident Level belongs to country_01

Mining sector is the most effected and severity level of Accidents also belongs to the same sector

The First and second quater accounts for higher level of Accident which is level 4 and 5.

Most of the classes of Potential Accident Level are from other class of Critical Risk which is 232 in No.

The severity of the Potential Accident Level are from the class Fall, Electrical installation, Vehicles, Projection, Pressed and Mobile equipment.

Mining sector is the most effected sector and most of the classes of Critical Risk comes from this sector.

Accident Level vs Potential Accident Level

fig = px.histogram(ds, color ='Potential Accident Level', x='Accident Level', width=800, height=500) fig.update_layout(bargap = 0.2) fig.show()

Class 1 of the Accident Level accounts for most of the accident and reaches to all the classes of Potential Accident Level which is 1,2,3,4,5

Third Party and Employee are the most effected Employee type

Males are the most effected gender with Potential Accident Level 4 and 5 which is from Mining sector.

Local_3 is the most effected city and most effected class of Employee type are Third Party and Employee.

Local 3 has highest number of Mining industry sector accident.

Local 5 has highest number of Metals industry sector accident.

All the Mining industry sector accidents happend in Local 1,2,3,4,7.

All the Metals industry sector accidents happend in Local 5,6,8,9 .

All the Others industry sector accidents happend in Local 10,11,12.

Most of the Accidents happend in the year 2016 and lower in 2017.

Most of the Mining Accidents happend in the year 2016 and lower in 2017.

Part 3

3.1 Deciding Models and Model Building

Design, train and test with various classifiers

Text Data Cleaning

Analysing the text variable

Design, train and test machine learning classifiers

Random Forest model Training and Evaluation

Gradient Boosting model for Training and Evaluation

Logistic Regression model for Training and Evaluation

Linear SVC model for Training and Evaluation

Design, train and test Neural networks classifiers

Pad Sequences - for Train and Test

Create a weight matrix using GloVe embeddings

Embedding Layer gives us 3D output -> [Batch_Size , Review Length , Embedding_Size]

The Neural Network model is not learning well. Accuracy is 40%

Design, train and test RNN or LSTM classifiers

Bi Directional LSTM is working with best acuracy of 51%. Model needs more data cleaning.

FastText Model

With Hyper parameters tuning with below code such as EPOC,Learnign Rate,Word_grams, hierarchical softmax and Multi label(just tried)

Trainig with EPOC 300 for both Accident and Potential accident levels

ft_model_Potential.test("/home/mario/Great learning/RASA_GL/temp/valid_Potential.valid")

With WordNGrams

Adding more than '1' Wordgram decrsesing the accuracy so No effect or improvement adding wordgram hyperparameter

With hierarchical softmax - Adding this hyper parameter causing the accuracy to reduce,so no good

So far FastText giving the accuracy of

Potential Accident level - 43%

Fine tuning the Model and the approach-

The challenge at hand is that we do not have a large dataset, Our dataset has only 425 records

One of the main reasons for not achieving very high accuracy could be the lack of large labeled text datasets. Most of the labeled text datasets are not big enough to train deep neural networks because these networks have a huge number of parameters and training such networks on small datasets will cause overfitting.

We are also aware that the NLP models are typically more shallow and thus require different fine-tuning methods

BERT (Bidirectional Encoder Representations from Transformers) is a big neural network architecture with Millions of parameters. So, training a BERT model from scratch on a small dataset would result in overfitting.

So, we propose to use a pre-trained BERT model that was trained on a huge dataset, as a starting point and then we can further train the model on our relatively smaller dataset.
We will be exploring different Fine-Tuning Techniques mentioned below in the weeks to come
Train the entire architecture – We can further train the entire pre-trained model on our dataset and feed the output to a softmax layer. In this case, the error is back-propagated through the entire architecture and the pre-trained weights of the model are updated based on the new dataset.
Train some layers while freezing others – Another way to use a pre-trained model is to train it partially. What we can do is keep the weights of initial layers of the model frozen while we retrain only the higher layers. We can try and test as to how many layers to be frozen and how many to be trained.
Freeze the entire architecture – We can even freeze all the layers of the model and attach a few neural network layers of our own and train this new model. Note that the weights of only the attached layers will be updated during model training.

We will probably use the this last approach. We will freeze all the layers of BERT during fine-tuning and append a dense layer and a softmax layer to the architecture.

Adjusting Hyper parameters-

We want to build a model that performs robustly and to this effect, we use the same set of hyper parameters across tasks and validation set. We shall also explore AWD-LSTM language model (Merity et al., 2017a) with an embedding size of 400, 3 layers,1150 hidden activations per layer, and a BPTT batch size of 70. We apply dropout of 0.4 to layers, 0.3 to RNN layers, 0.4 to input embed-ding layers, 0.05 to embedding layers, and weight dropout of 0.5 to the RNN hidden-to-hidden matrix. The classifier has a hidden layer of size 50.

We use Adam with β1= 0.7 instead of the de-fault β1= 0.9 and β2= 0.99, We use a batch size of 64, a base learning rate of 0.004 and 0.01 for fine tuning the language models and the classifier respectively.

We are hopeful that by adopting these fine tuning techniques we will be able to achieve high accuracy for the final model that we deploy